By: Saurabh Hinduja
This section contains information about the dataset selected, reasons why the dataset was selected
Dataset Selected: Propser
What is Propser
There are many peer to peer networks, like file sharing, meeting new friends and many more. Prosper is America’s first and leading peer-to-peer lending network. In this kind of network, lenders research borrowers and lend them money directly.
Prosper works because it has lower rate of interest than banks and multiple lenders can contribute to the same borrower. Therefore, it is lucrative for borrowers because it has lower rate of interest and lenders like it cause it reduces their risk and can contribute towards a bigger loan amount.
Why Prosper dataset
I have used a lot of peer-to-peer networks, usually for file sharing or for meeting new people. When I saw Prosper on the Udacity website, I was intrigued on how peer-to-peer network can be used for financial transactions (mainly borrowing/lending money). I wanted tofind out how Prosper authenticates borrowers and keeps track of all the transactions. A few questions I had were: What is the repayment rate? Is there any failure to return? What factors which affect the credibility of the borrower? Factors which influenced default?
This dataset is part of [Udacity data set options] (https://docs.google.com/document/d/1qEcwltBMlRYZT-l699-71TzInWfk4W9q5rTCSvDVMpc/pub?embedded=true) for the assignment in R which is part of their Data Analysis Nanodegree.
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
Below are the steps which will be followed for analysing the Prosper dataset:
An indepth description of the variabls are available from Udacity. After spending considerable amount of time studying the variables, 20 variables were selected for further analysis.
The following are the features of interest for this data analysis:
1. Term Length of the loan expressed in months
2. LoanStatus The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
3. BorrowerState The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
4. ListingCategory The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
5. CreditScoreRangeLower The lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
6. CreditScoreRangeUpper The upper value representing the range of the borrower’s credit score as provided by a consumer credit rating agency.
7. BankcardUtilization The percentage of available revolving credit that is utilized at the time the credit profile was pulled.
8. IncomeRange The income range of the borrower at the time the listing was created.
9. LoanOriginalAmount The origination amount of the loan.
10. Investors The number of investors that funded the loan.
11. ListingCreationDate The date the listing was created.
12. Occupation The Occupation selected by the Borrower at the time they created the listing.
13. IsBorrowerHomeowner A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
14. BorrowerAPR The Borrower’s Annual Percentage Rate (APR) for the loan.
15. BorrowerRate The Borrower’s interest rate for this loan.
16. Recommendations Number of recommendations the borrower had at the time the listing was created.
17. TotalProsperLoans Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans.
18. DebtToIncomeRatio The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
19. StatedMonthlyIncome The monthly income the borrower stated at the time the listing was created.
20. EmploymentStatus The employment status of the borrower at the time they posted the listing.
There were more varaibles of interest, but they were not consistent (For example: CreditGrade was availale only pre-2009)
Exploratory data analysis is divided into 3 sections, depending on number of data variables taken for analysis. (Univariable, Bivariate and Multivariate)
The first step for exploratory data analysis is to check verify that the dataset has 81 variable and 113,937 variable. Also, we will be subsetting the dataset and forming a new dataset which will just have the variables for our interest.
Verify the dimensions
dim(prosper_data)
[1] 113937 81
After verification that the dataset is complete, we will now subset the dataset
# First we create a list of variables to subset
VarstoSubset <- c("Term","LoanStatus","BorrowerState","ListingCategory..numeric.",
"CreditScoreRangeLower","CreditScoreRangeUpper",
"BankcardUtilization","IncomeRange", "TotalProsperLoans",
"LoanOriginalAmount", "Investors","ListingCreationDate",
"Occupation","IsBorrowerHomeowner","BorrowerAPR","BorrowerRate",
"Recommendations","DebtToIncomeRatio", "StatedMonthlyIncome",
"EmploymentStatus")
## Now we create the data subset
prosperSubSet <- subset(prosper_data, select = VarstoSubset)
## Check the completeess of the data
dim(prosperSubSet)
[1] 113937 20
## Check the Variables and their data types
str(prosperSubSet)
'data.frame': 113937 obs. of 20 variables:
$ Term : int 36 36 36 36 36 60 36 36 36 36 ...
$ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
$ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
$ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
$ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
$ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
$ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
$ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
$ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
$ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
$ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
$ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
$ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
$ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
$ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
$ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
$ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
$ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
$ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
$ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
We verified that the subset has 113937 rows and 20 variables alpng with their data types.
Before we continue with our data analysis we need to define who is a borrower in default. The propser dataset has done a very good job in catagorizing the dataset, but if dig into the definations of “Chargedoff” and “Defaulted” we find that their are still a few more rows which should fall into defaulted (For example: people who Past Due(>61 days) )
Borrowers in Default
Creating a new variable DefaultBorrowers (1 - Person is in Default, 0 - Not in Default)
prosperSubSet$DefaultBorrowers <- ifelse(
prosperSubSet$LoanStatus == "Defaulted" |
prosperSubSet$LoanStatus == "Chargedoff" |
prosperSubSet$LoanStatus == "Past Due (61-90 days)" |
prosperSubSet$LoanStatus == "Past Due (91-120 days)"|
prosperSubSet$LoanStatus == "Past Due (>120 days)",
1,0)
str(prosperSubSet)
'data.frame': 113937 obs. of 21 variables:
$ Term : int 36 36 36 36 36 60 36 36 36 36 ...
$ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
$ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
$ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
$ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
$ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
$ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
$ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
$ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
$ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
$ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
$ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
$ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
$ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
$ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
$ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
$ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
$ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
$ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
$ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
$ DefaultBorrowers : num 0 0 0 0 0 0 0 0 0 0 ...
The first thing we will see that if there there were some economic crisis which caused the an increase in loan defaults.
# Year in which loan was taken
prosperSubSet$LoanTakenYear <- format (
as.Date(prosperSubSet$ListingCreationDate), "%Y"
)
# Plot the new variable
ggplot(prosperSubSet, aes(prosperSubSet$LoanTakenYear, fill = factor(DefaultBorrowers))) + geom_bar() + scale_fill_discrete( "Kind of borrower", labels = c ("Good Borrowers", "Default Borrowers")) + ggtitle("Borrowers over the years") + labs(x = "Number of Loans through the years")
There are few important things to note here:
The ratio of defaulters was high in 2007,2008
There was a sudden dip in number of loans taken in 2009. After a bit of research we can see that there was an economic depression in 2009.
There was a exponential increase number of loans taken fron 2009 to 2013.
There was a dip again in 2014. Again, a bit of searching showed that there was an economic slowdown in 2014
There was a a linear increase in number of defualters from 2009 to 2012
It should be kept in mind that the above figures are only on new loans taken in that year. It does not include the on-going loans during that period.
Loan Term and Loan Status
Now we should look into continuing loans. The status of the loan and its status.
For this we will break down terms into 4 terms.
Start - 0 months
1 year - 12 months
3 years - 36 months
5 years - 60 months
# Check summary of loan terms
summary(prosperSubSet$Term)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.00 36.00 36.00 40.83 36.00 60.00
#Check summary of default borrowers
summary(prosperSubSet$DefaultBorrowers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.1548 0.0000 1.0000
#Check summary of Loan status
summary(prosperSubSet$LoanStatus)
Cancelled Chargedoff Completed Current Defaulted FinalPaymentInProgress Past Due (>120 days)
5 11992 38074 56576 5018 205 16
Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
806 265 363 313 304
#Create graphs
ggplot(prosperSubSet, aes(Term, fill = factor(DefaultBorrowers))) +geom_bar() +
scale_x_continuous(breaks = c(0,12,36,60)) +
scale_fill_discrete("Kind of Borrowers", labels = c("Good Borrowers", "Default Borrowers")) +
ggtitle("Loan Terms") + labs(x = "Loan Terms")
From the summary we can observe the following: 1. Median length of loan is 36 months 2. 15% of the borrowers are in default
Loan amount and investors
Now let us check the details on the original loan amount and investors interested in giving loans.
#Check summary for laoan amount
summary(prosperSubSet$LoanOriginalAmount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1000 4000 6500 8337 12000 35000
#Check summary for average investors
summary(prosperSubSet$Investors)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 44.00 80.48 115.00 1189.00
#Plot loan ammount
loanterm <- ggplot(prosperSubSet, aes(LoanOriginalAmount, fill = factor(DefaultBorrowers))) +
scale_fill_discrete("Kind of Borrowers", labels = c ("Good Borrowers", "Default Borrowers")) +
ggtitle("Loan Amount") + labs(x = "Loan Amount")
loanterm + geom_histogram(binwidth = 1000)
From the above information it can be noted that: 1. The median loan amount is $6,500. Average loan amount is $8,337. The loan range is $1000 to $35,000 with a third quantile at $12,000. 2. Median number of investors for the loan are 44, with third quantile on 115 and maximum number of investors being $1189, this could be an outlier. 3. The graph for borrowers (good and default) is right skewed. 4. The high number of borrowers at $10,000 , $15,000, $20,000, $25,000 pull the average to be higher than median
Now, let us look deeper into the graph and try to eliminate some outliers by examining just 95% of the data.
loanterm +
geom_histogram(binwidth = 500) +
scale_x_continuous( limits = c (500, quantile(prosperSubSet$LoanOriginalAmount, 0.95)),
breaks = c(0,2000,4000,6000,8000,10000,12000,14000,16000,18000,20000))
We can see that below the mean, that is, $8,337 people request in the all kinds of amounts in the range. But, above $8,337 people request in “whole” values, of $10,000 and $15,000.
Income Range
Analyzing the income range of borrowers.
#Summary of Income range
summary(prosperSubSet$IncomeRange)
$0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999 $75,000-99,999 Not displayed Not employed
621 7274 17337 32192 31050 16916 7741 806
#Ploting the income range
plotincomerange <- ggplot(prosperSubSet, aes(IncomeRange, fill = factor(DefaultBorrowers))) +
scale_x_discrete(limits = c ("Not employed", "$0", "Not displayed",
"$1-24,999", "$25,000-49,999", "$50,000-74,999",
"$75,000-99,999", "$100,000+")) +
geom_bar() +
scale_fill_discrete(labels = c("Good Borrowers", "Bad Borrowers")) +
ggtitle("Income Range") + labs(x = "Income Range") +
theme(axis.text.x = element_text(angle = 90))
plotincomerange
NA
From the above data it can bee seen that: 1. Majority of the borrowers earn between $25,000 to $75,000 2. There are people earning more than $100,000 who have taken loans 3. The greatest ratio of Bad borrowers to total borrowers is of “Not Displayed”. It can be assummed that borrowers in “Not displayed” also have some income. 4. The second maximum number of “Bad borrowers” fall in the income range of $25,000 - $50,000
Borrowers State
Now, we plot the state in which the borrowers live. This is done to check if there is a pattern in place where bad borrowers live.
options(warn=-1)
#Summary of borrowers across the states
summary(prosperSubSet$BorrowerState)
AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC
5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593 2318 2615 787 330 3084
ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
52 674 551 3097 472 1090 6729 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207 3048 1842 391 150
#Inital graph of borrowers
initialstateplot <- ggplot(na.omit(prosperSubSet), aes(BorrowerState, fill = factor(DefaultBorrowers))) +
geom_bar(position = "dodge") +
scale_fill_discrete(labels = c("Good Borrowers", "Bad Borrowers")) +
ggtitle("Borrowers State") + labs(x = "Borrowers State") +
theme(axis.text.x = element_text(angle = 90))
initialstateplot
#Mapping States to a map
prosperSubSet$BorrowerStateFullName <- tolower(state.name
[match(prosperSubSet$BorrowerState, state.abb)])
usmap <- map_data("state")
usstates <- data.frame(state.center, state.abb)
#Mapping bad borrowers in each state
bad_borrow_state <- prosperSubSet %>%
group_by(BorrowerStateFullName, BorrowerState, DefaultBorrowers) %>%
summarise(count = n()) %>%
filter(DefaultBorrowers == 1)
DefaultMap <- ggplot() +
geom_map(data = usmap, map =usmap, aes(x = long, y = lat, map_id = region, label = region))+
geom_map(data = na.omit(bad_borrow_state), map=usmap, aes(fill = count, map_id = BorrowerStateFullName)) +
geom_text(data = usstates, aes(x = x, y = y, label = state.abb, group = NULL)) +
scale_fill_continuous(low ='blue', high='red', guide = 'colorbar')+
labs(x="Bad Borrowers", y=NULL) +
coord_map("albers", lat0= 39, lat1 = 45)
Ignoring unknown aesthetics: x, y, label
DefaultMap
From the bar graph it can be seen: 1. California has the maximum number of borrower, 14717 and it can pull all the statistics towards it. 2. Second highest number of borrowers are in Texas which has less than half of California, 6842
From the map it can be seen: 1. California stands out, that means it has the highest number of Bad borrowers, but this can also be due to the fact that California also has the highest number of total borrowers 2. Other Sates which stand out are Texas, Illonois, Florida and Georgia. 3. Texas, Florida and Illonois also have high number of borrower nearly 6000 4. Georgia has high number of borrowers and lower number of total borrowers. 5. Central America has lowest number of borrower and bad borrowers. Reasons for this may be lower cost of living, low population. 6. North Dakota and Wyoming have least number of borrowers
Listing Catagory s Analyzing the purpose for which the loan was taken out.
#Create a variable for Listing catagory
ListCat <- c("Not Available", "Debt Consolidation", "Home Improvement",
"Business", "Personal Loan", "Student Use", "Auto", "Other",
"Baby&Adoption", "Boat", "Cosmetic Procedure",
"Engagement Ring", "Green Loans","Household Expenses",
"Large Purchases", "Medical/Dental", "Motorcycle",
"RV", "Taxes", "Vacation", "Wedding Loans")
#Mapping the new variable
prosperSubSet$ListingCategoryName <- ListCat[(prosperSubSet$ListingCategory..numeric.)+1]
#Plotting the Listing Category
ListCatGraph <- ggplot(prosperSubSet, aes(prosperSubSet$ListingCategoryName, fill = factor(DefaultBorrowers))) +
geom_bar() +
scale_fill_discrete("Borrowers", labels = c("Good Borrowers", "Bad Borrowers")) +
ggtitle("Income Range") + labs(x = "Income Range") +
theme(axis.text.x = element_text(angle = 90))
# Debit Consolidation overshadoewed other values due to which can't observe anything adding log
ListCatGraph + scale_y_log10()
ListCatGraph
From the above graphs it can be seen that: 1. Highest number of borrowers in Debt Consolidation, followed by Not Available 2. Debt consolidation and Not available have the same number of Bad borrowers, but Not Available has lower number of borrrowers. Therefore, Not available has higher percentage of bad borrowers. 3. None of the borrowers who borrowed money for RV are in default. 4. Debt consolidation are people who have a lot of loan from a lot of debt from multiple sources.
Now to look at default borrowers let us dive into bivariate analysis
Debt to Income Ratio and Stated Income
Let us look at the debt to income trtio of borrowers across the prosper dataset. We are interested in knowing if people wiht high income take higher loans or other way round and what are case of them being in default. FOr this we will make a scatter plot and consider only perople with income less than $25k
ggplot(prosperSubSet, aes(StatedMonthlyIncome,
DebtToIncomeRatio,
colour = factor(DefaultBorrowers))) +
geom_point(alpha = 0.5) +
scale_x_continuous(limits = c(0,25000)) +
#scale_y_continuous(limits = c(0,10)) +
scale_y_log10() +
scale_color_discrete(labels = c("Good Borrowers", "Bad Borrowers")) +
geom_density2d()
The log scale displays some interesting characteristics: 1. Most have debt-to-income ratio greater than 0.01 2. There are some outliers with debt-to-income ratio of 0 3. The contour lines (rings) show a high concentration of bad borrowers. 4. Changing the y limitds show us that a lot of the borrowers have an income of zero and are on Y axis all the way to 10. 5.It suggests a negative correlation between income and debt
Debt to Income and Interest Rate
Let us see what relationship does Debt-to-interest have with Interest rate
ggplot(prosperSubSet, aes( DebtToIncomeRatio, BorrowerRate)) +
geom_point(alpha =0.1) + scale_x_log10()
Points to be noted 1. Positive correlation between Borrower rate and Debt-to-income ratio
Listing Category and Loan Amount
Creating a few box blots ti examine the loan category, loan amount and borrowing rate.
ggplot(prosperSubSet, aes(ListingCategoryName,
LoanOriginalAmount,
group = ListingCategoryName)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90))
NA
From the above graph it can be noted that: 1. Debt consolidation and Baby Adoption have the highest loan amount 2. Lowest loan amounts are for Auto, Household expenses, Motorcycle, Personal loan and Student use.
Loan Category and Borrower Rate
Box plot for Loan Category and Borrower rate
ggplot(prosperSubSet, aes(ListingCategoryName, BorrowerRate, group = ListingCategoryName))+
geom_boxplot()+
theme(axis.text.x = element_text(angle = 90))
Average borrowing rate is between 0.15 and 0.25 acroess all categories
Loan Term and Loan Amount
Box plot between Loan term and loan amount.
ggplot(prosperSubSet, aes(Term, LoanOriginalAmount, group = Term))+
geom_boxplot() +
scale_x_continuous(breaks = c(0,12,36,60)) #Added later to make graph better
Points to note: 1. The average loan amount for 12 months term is lowest followed by 36 and 60. This is expected as it takes longer to return a bigger loan amount.
Loan Amount and Bankcard Utilization
Let us see how do people utalize their bankcard
ggplot(prosperSubSet, aes(LoanOriginalAmount, BankcardUtilization))+
geom_point(alpha=0.1, size=0.75,position = "jitter")+
scale_y_log10()
Things to note: 1. Card utilization is higher for people taking out a lower loan 2. Spikes in plot at 10K, 15K, 20K, 25K 35K . People might be taking loand of this amount more
Loan Amount and Investore
Let us analyse if there is relation between Loan Amount and Investors
ggplot(prosperSubSet, aes(LoanOriginalAmount, Investors)) +
geom_point(alpha =0.1, position ="jitter")
Now let us do some multivariate analysis to see how they affect each other and to rule out any correlations which might look like causation.
To perform multivariate analysis we have to first calculate the pearson correlations.
Pearsons Correlations
ggcorr(prosperSubSet, label = TRUE, label_size = 2, hjust = 0.8, size = 2.5, angle =-10)
data in column(s) 'LoanStatus', 'BorrowerState', 'IncomeRange', 'ListingCreationDate', 'Occupation', 'IsBorrowerHomeowner', 'EmploymentStatus', 'LoanTakenYear', 'BorrowerStateFullName', 'ListingCategoryName' are not numeric and were ignored
Points to be noted from the correlation 1. Stong correlation (-0.5) between Credit Score and Borrower Rate 2. Correlation between credit score and default borrower of -0.3 3. Strong correlation between loan amount and number of investors 0.4 4. Correlation between Term and Loan Amount is 0.3 (This was predicted earlier) 5. If borrower APR is high number of investors is low (-0.3)
Credit Score, Loan Amount, Type of Borrower
Let us see if we can find some correlation on these 3 variables.
CLG1 <- ggplot(subset(prosperSubSet, DefaultBorrowers == 1), aes(CreditScoreRangeUpper, LoanOriginalAmount)) +
geom_point(alpha = 0.2, color = "red")+
ggtitle("Bad Borrowers")
CLG2 <- ggplot(subset(prosperSubSet, DefaultBorrowers == 0), aes(CreditScoreRangeUpper, LoanOriginalAmount)) +
geom_point(alpha = 0.2, color = "green")+
ggtitle("Good Borrowers")
grid.arrange(CLG1, CLG2, nrow=1)
A few observations from the above graphs: 1. No bad borrowers have taken loan above 25k 2. Lower the credit score, lower the credit score, higher change of borrower being bad. Usually this is below 10K 3. At higher loand amount like 25K even borrowers with high credit score of 700+ are bad borrowers 4. Averga loan amount for good borrower (Credit score above 700) is 20K
Loan Amount, Year and Income Range
Let us see if there is any difference in Loan amount taken by an income group over the years. This can show if there is any effect of inflation.
positions <- c ("Not employed", "$0", "Not displayed",
"$1-24,999", "$25,000-49,999", "$50,000-74,999",
"$75,000-99,999", "$100,000+")
prosperSubSet <- transform(prosperSubSet,
IncomeRange = factor(IncomeRange,
levels = positions,
labels = positions))
ggplot(prosperSubSet, aes(LoanTakenYear, LoanOriginalAmount)) + geom_boxplot()+
facet_grid(IncomeRange ~ .)
Interesting observations 1. Average Loan amount take has increased over the years across the income groups. This might be due to inflation 2. Dip in loan about across all income 2009-2010. 3. Higher the income group higher the average loan amount 4. Higher the income group more variance in the loan amount.
Loan Year, Loan Amount, Type of Borrower
ggplot(prosperSubSet, aes(LoanTakenYear, LoanOriginalAmount, fill = factor(DefaultBorrowers))) +
geom_boxplot() +
facet_grid(IncomeRange ~ .)+
scale_y_continuous(limits = c(0,25000))+
scale_fill_discrete(breaks = c(0,1), labels = c("Good Borrowers", "Bad Borrowers"))
Observations 1. Higgher average loan amount take by bad borrowers till 2010 2. Lower average loan amount by bad borrowers post 2010 3. No bad borrowes in 2014 (incomplete data??)
Now let us dive deeper into the dataset and get some good plots to make some reasonable observations
Plot 1
In this plot let us see the relationship between monthly income, debt-to-income ratio categorized by the type of borrowers borrowers. To reveal hidden trends we have to manage overplotting we will apply density contours
ggplot(prosperSubSet, aes(StatedMonthlyIncome, DebtToIncomeRatio, color = factor(DefaultBorrowers))) +
geom_point(alpha = 0.1) +
scale_color_discrete(labels = c("Good Borrowers", "Bad Borrowers")) +
scale_x_continuous(limits = c(0,15000),labels = scales::dollar) +
scale_y_log10()+
labs(x = "Monthly income", y="Debt to Income Ratio", title = "Debt-to-income Monthly Income by Status")+
geom_density2d()
This graph shows that there is a negative correlation bewtween monthly stated income and debt-to-income ratio. That is, the lower the income the higher the debt-to-income ratio, it should be noted that this does not mean that people with lower income have higher debt, the total amount of debt for lower income and higher income can be the same.
Plot 2
This plot we will see loan amout and loan take year catagorized by type of borrowers. But first let us modify the graph to easily understand it. Following are some of the tweaks. 1. Remove “Not Displayed” income category as it does not give us any information. 2. Print all income in thousands.
ggplot(subset(prosperSubSet, prosperSubSet$IncomeRange != "Not displayed"),
aes(LoanTakenYear,
LoanOriginalAmount/1000,
fill = factor(DefaultBorrowers))) +
geom_boxplot() +
scale_y_continuous(labels = scales::dollar) +
facet_grid(IncomeRange ~ .) +
scale_fill_discrete(name = "Borrowers",
labels = c("Good Borrower", "Bad Borrower")) +
labs(x="Year in which loan was taken", y="Loan Amount (1000)",
title = "Loan Amount, Income Range across Years with type of borrowers")
The plots show that there is high variance in loan amount take by default borrower in 2007. There was a dip in loans take in 2009 which can be attributed to the depression, The general increase in loan amount across the board over the years can be attributed to inflation.
Plot 3
For this plot we revist loans across the listing categories. Here we will use log10 for better clearity.
ggplot(prosperSubSet, aes(prosperSubSet$ListingCategoryName, fill = factor(DefaultBorrowers))) +
geom_bar(position="dodge") +
scale_fill_discrete("Borrowers", labels = c("Good Borrowers", "Bad Borrowers")) +
ggtitle("Income Range") + labs(x = "Income Range") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x="Loan Category", y="Number of Borrowers (log10)",
title = "Borrowers per Loan Category ") + scale_y_log10()
We can reiterate the observation made earlier that “debt consolidation” is an outlier and most probably is a lot of kind of loans lumped into one. People who have taken loans for “RV” are not in default.
The prosper dataset is over 86 MBs with over 100k rows and 81 variables spanning over 9 years. I selected this dataset just because it was bigger and I had to think about it there werent any clear questions. This dataset posed a lot of unique challenges, some of them were, understanding the variables, understanding financial terms, selecting the variables to investigate. I have just made plots but as I did not understand a lot of factors but I would like to do more analysis like annova after studying a bit of finance and when I know what factors will make an impact in determining whether someone will be in default. For this analyses I would have to consider all the 81 variable and that might go out of the scope of this project.
The plots helped in determining the trends of various correlations and also ave me a lot of practice in ggpot in R whuch I feel would help me a lot in the future. I also realized how some outliers could bias the data, like California, Debt consolitdation; for such data it is very important to scale the data or leave the outlier as it can lead to incorrect analysis.
Overall, this project helped me raise questions on my own and helped to understand how to visalize data so that maximum information and conclusion can be dran from it.